MXPlank - The Elegant Universe Of Space-Time

Connecting To The Server To Fetch The WebPage Elements!!....

MXPlank.com

Submit Research Thesis

Electronics - MicroControllers

ScienceCasts

Earth's Magnetosphere

Elucidating The Black Holes

The Surprising Power of a Solar Storm

A Close Encounter With Jupiter

Ancient remnants deep in the Kuiper belt

The Super Fluid Core Of A Dead Neutron Star

Massive Cloud On Collision Course With Milky Way

Mysterious Objects at the Edge of the Electromagnetic Spectrum

Big Mystery in the Perseus Cluster

Spacecraft discovers thousands of doomed comets

Close Encounter with Enceladus

The Sounds Of The InterStellar Space

Search The Site

Practical Issues in Neural Network Training-Regularization

Since a larger number of parameters causes overfitting, a natural approach is to constrain the model to use fewer non-zero parameters. In the previous example, if we constrain the vector W̄ to have only one non-zero component out of five components, it will correctly obtain the solution [2,0,0,0,0]. Smaller absolute values of the parameters also tend to overfit less. Since it is hard to constrain the values of the parameters, the softer approach of adding the penalty λ||W̄||^p to the loss function is used. The value of p is typically set to 2, which leads to Tikhonov regularization. In general, the squared value of each parameter (multiplied with the regularization parameter λ>0) is added to the objective function. The practical effect of this change is that a quantity proportional to λW_i is subtracted from the update of the parameter W_i. An example of a regularized version of Equation1.6 for mini-batch S and update step-size α > 0 is as follows:

Here,E[X̄] represents the current error (y-ŷ) between observed and predicted values of training instance X. One can view this type of penalization as a kind of weight decay during the updates. Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which "less important" (i.e.,noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization.

As a side note, the general form of Equation 1.33 is used by many regularized machine learning models like least-squares regression, whereE(X̄) is replaced by the error-function of that specific model. Interestingly, weight decay is only sparingly used in the single-layer perceptron because it can sometimes cause overly rapid forgetting with a small number of recently misclassified training points dominating the weight vector; the main issue is that the perceptron criterion is already a degenerate loss function with a minimum value of 0 at W̄= 0 (unlike its hinge-loss or least-squares cousins).

This quirk is a legacy of the fact that the single-layer perceptron was originally defined in terms of biologically inspired updates rather than in terms of carefully thought-out loss functions. Convergence to an optimal solution was never guaranteed other than in linearly separable cases. For the single-layer perceptron, some other regularization techniques, which will be discussed in the coming posts are more commonly used.